Big data analysis and visualisation: Session 2

Acquiring, preparing, and visualizing data

J.M.T. Roos

Last updated: 2017-05-08 11:56:16

Review

  • In class
    • git — cloning, committing, pushing
    • R Markdown — mixing text with R code using ```{r} blocks
  • During the week
    • Reviewing/learning the R programming language
    • Installing packages

Applied problem: Merging samples

Repeated measures for 11 individuals, mean (sd)

Round Duration Number Correct
1 7.5 9.0
(2.0) (3.3)
2 7.5 9.0
(2.0) (3.3)
3 7.5 9.0
(2.0) (3.3)
4 7.5 9.0
(2.0) (3.3)

Applied problem: Merging samples

Regression of Duration on Number Correct repeated for each round

Round Term Estimate SE
1 (Intercept) 3.0 1.12
num.correct 0.5 0.12
2 (Intercept) 3.0 1.13
num.correct 0.5 0.12
3 (Intercept) 3.0 1.12
num.correct 0.5 0.12
4 (Intercept) 3.0 1.12
num.correct 0.5 0.12

Remember…

Always look at the data first

This session

  • Data visualization with ggplot2
    • Basics
    • In-class exercises
    • More advanced concepts
    • In-class exercises
  • Tidying and summarizing data with dplyr, reshape2, and tidyr
    • Single table operations
    • In-class exercises
    • Joins
    • In-class exercises
  • Acquiring (tabular) data (briefly covered)

ggplot2

  • Plotting package in R intended to replace the core plotting routines
  • Based on the concept of a grammar of graphics
    • Plots are constructed from simpler components, much as sentences are constructed from nouns, verbs, etc.
    • Not all arrangements of words lead to comprehensible sentences — the same is true for plots, and ggplot2 helps you avoid (visual) nonsense
    • This approach leads to a modularity of design, making it easy for programmers to extend
  • Sensible and aesthetically pleasing default settings
    • Informed by what we know about visual perception and cognition

What is a graph?

A visual display that illustrates one or more relationships among numbers…a shorthand means of presenting information that would take many more words and numbers to describe.

—Stephen M. Kosslyn. Graph Design for the Eye and Mind. Oxford University Press, 2006

It depends on the goal:

  • A tool for discovery — gain an overview of, convey the scale and complexity of, or facilitate an exploration of data (dataviz)
  • A tool for communication — help you to help others understand, tell a story about, or stimulate interest in a problem or solution (infographics)

At a minimum…

  • Graphs are for comparing quantities
    • Always ask yourself: “Comparing what?”
    • Insist on this comparison being obvious to the viewer (and yourself)
  • Each graph should answer a central question
    • Both the question and answer should be clear
    • Use the title, caption, and other labels to highlight both
      • For manuscripts, these text elements are excluded from your word count, and you should take advantage of that

Psychological principles (Kosslyn, 2006)

  • I will briefly cover some of what we know about human cognition of data visualizations
  • If you want to know more, the book by Kosslyn (see quote earlier) is a good reference
  • Book divides what we know into 8 principles, which I think fall into 3 buckets:
    • Connecting with the audience
    • Directing and holding attention
    • Promoting understanding and memory

Connecting with the audience

  1. Relevance
    • Not too much or too little information
    • Present information that reflects the message you want to convey
    • Don’t present extraneous information
  2. Appropriate knowledge
    • Prior knowledge must be sufficient to understand the graph
    • If you assume too much prior knowledge, viewers will be confused
    • If you violate norms, viewers will be confused

Directing and holding attention

  1. Salience
    • Attention is drawn to large perceptible differences
    • The most visually striking aspect receives the most attention
    • Annotations help direct viewers’ attention
  2. Discriminability
    • Properties must differ enough to be noticed
    • Defaults in ggplot2 do much of this work for you
  3. Organization
    • Groups of elements are seen and remembered as a whole

Understanding and memory

  1. Compatibility
    • Form should be aligned with meaning
    • Lines express continuous change, bars discrete quantities
    • More = more (higher, better, bigger, etc.)
  2. Informative changes
    • Changes in properties should carry information
    • …and vice versa
  3. Capacity limitations
    • If too much information is presented, none is remembered
    • Four chunks in working memory
    • Graph designers err on the side of presenting too much, graph readers err on the side of paying too little attention

ggplot2’s grammar

  • Decomposes graphs into basic parts
  • Sets rules for interactions among those parts
  • Helps us stay out of trouble

ggplot2’s grammar

  • Default values for Data and Mapping available to all layers
  • Layers — one or more, each with the following:
    • Data (overriding the default) — a data.frame
    • Mapping (overriding the default) of columns to Aesthetics
    • Geometry specifying what to draw
    • Statistic specifying how to transform the data before drawing
    • Position specifying how to arrange items
  • Facet specification for generating subplots
  • Scales specifying how to translate the data to lengths, colors, sizes, etc. in the graph
  • Coordinates which is the default (Cartesian) 99% of the time, so ignore for now

Layers

  • Layers contain everything we see, often showing different views of the same data

Test data

test_data
## # A tibble: 44 × 4
##     round respondent num.correct duration
##    <fctr>     <fctr>       <dbl>    <dbl>
## 1       1          1          10     8.04
## 2       1          2           8     6.95
## 3       1          3          13     7.58
## 4       1          4           9     8.81
## 5       1          5          11     8.33
## 6       1          6          14     9.96
## 7       1          7           6     7.24
## 8       1          8           4     4.26
## 9       1          9          12    10.84
## 10      1         10           7     4.82
## # ... with 34 more rows

Defaults

  • Specify the defaults first
  • Most graphs use a single set of data (data.frame) for every layer
  • Most graphs use a single set of mappings between columns and aesthetics
my_plot <- ggplot(data = test_data, mapping = aes(x = duration,
    y = num.correct))
  • aes() is used to create a list of aesthetic mappings
    • x refers to the graph’s x-axis, y to the y-axis
    • duration \(\rightarrow\) x-axis
    • num.correct \(\rightarrow\) y-axis
  • my_plot now represents a ggplot object set to our defaults
  • You don’t need to name the arguments; data comes first, mapping comes second
my_plot <- ggplot(test_data, aes(x = duration, y = num.correct))

An empty plot

  • Defaults by themselves do nothing
print(my_plot)

  • By default, we get an “empty” plot
  • To see something, we need to specify a layer

Adding a layer

  • Use the + operator to combine ggplot elements
my_plot + geom_point()

  • Usually you do not need the print() call, so the following two lines are equivalent:
    my_plot + geom_point()
    print(my_plot + geom_point())

Each layer has a geometry

my_plot + geom_point()
my_plot + geom_line()

my_plot + geom_point() + geom_line()

Each layer has a statistic

  • Usually the statistic is the identity function, \[f(x)=x\] That is, the data are left unchanged
  • The default statistic for geom_point and geom_line is identity so these plots show the data as is
  • The default statistic for geom_histogram is a binning function (called stat_bin)
ggplot(test_data, aes(x = duration)) + geom_histogram(binwidth = 2)

## # A tibble: 44 × 4
##     round respondent num.correct duration
##    <fctr>     <fctr>       <dbl>    <dbl>
## 1       1          1          10     8.04
## 2       1          2           8     6.95
## 3       1          3          13     7.58
## 4       1          4           9     8.81
## 5       1          5          11     8.33
## 6       1          6          14     9.96
## 7       1          7           6     7.24
## 8       1          8           4     4.26
## 9       1          9          12    10.84
## 10      1         10           7     4.82
## # ... with 34 more rows
## # A tibble: 5 × 2
##       x     y
##   <dbl> <dbl>
## 1     4     4
## 2     6    13
## 3     8    20
## 4    10     5
## 5    12     2

Geoms and statistics

  • Each geom/statistic has a default statistic/geom
Item Default stat/geom
geom_point stat_identity (\(f(x)=x\))
geom_line stat_identity (\(f(x)=x\))
geom_histogram stat_bin (binning)
geom_smooth stat_smooth (regression)
stat_smooth geom_smooth (line + ribbon)
stat_bin geom_bar (vertical bars)
stat_identity geom_point (dots)
  • Hence, these produce the same output:
    ggplot(test_data, aes(x = duration)) + stat_bin(binwidth = 1)
    ggplot(test_data, aes(x = duration)) + geom_histogram(binwidth = 1)

Data versus statistics

  • Be sure you understand: “Does this layer contain data or statistics?”
  • When in doubt, prefer data to statistics:
  • Example: A scatter plot of observations conveys more information than a box plot showing quantiles
ggplot(test_data, aes(x = round,
  y = duration)) + geom_point()

ggplot(test_data, aes(x = round,
  y = duration)) + geom_boxplot()

Aesthetics

  • Each geometry interacts with one or more aesthetics
Item Required Optional
geom_point xy alphacolourfillshapesizestroke
geom_line xy alphacolourlinetypesize
geom_pointrange xymaxymin alphacolourlinetypesize
  • You can either map data to an aesthetic, or set it explicitly
my_plot + geom_point(
  mapping = aes(colour = round))

my_plot + geom_point(
  colour="red")

Position

  • Each layer also has a position specification
  • The default is again identity meaning don’t do anything special
  • Examples: bars can be positioned with stack or dodge
g <- ggplot(test_data, aes(x = num.correct, fill = round))
g + stat_bin(binwidth = 4,
             position = 'stack')

g + stat_bin(binwidth = 4,
             position = 'dodge')

Practice with layers (Tasks 1–4)

  • Work with a neighbor
  • First discuss the task, then one of you does the typing (take turns for each task)
  • Discuss what you are doing as you write code
  • Write your code in an empty File > New File… > R Script and execute each line using Cmd-Enter (Mac) or Control-Enter (Windows)
  • Use the data set called mpg which is included in the ggplot2 package
  • Exercises can be found at http://jasonmtroos.com/assets/media/teaching/rook/session_2_in_class.html or http://goo.gl/Gx5LAK

Data

library(ggplot2) #library(tidyverse)
?mpg

Description

This dataset contains a subset of the fuel economy data that the EPA makes available on http://fueleconomy.gov. It contains only models which had a new release every year between 1999 and 2008 - this was used as a proxy for the popularity of the car.

Usage

mpg

Format

A data frame with 234 rows and 11 variables

manufacturer
model

model name

displ

engine displacement, in litres

year

year of manufacture

cyl

number of cylinders

trans

type of transmission

drv

f = front-wheel drive, r = rear wheel drive, 4 = 4wd

cty

city miles per gallon

hwy

highway miles per gallon

fl

fuel type

class

“type” of car

mpg
## # A tibble: 234 × 11
##    manufacturer      model displ  year   cyl      trans   drv   cty   hwy
##           <chr>      <chr> <dbl> <int> <int>      <chr> <chr> <int> <int>
## 1          audi         a4   1.8  1999     4   auto(l5)     f    18    29
## 2          audi         a4   1.8  1999     4 manual(m5)     f    21    29
## 3          audi         a4   2.0  2008     4 manual(m6)     f    20    31
## 4          audi         a4   2.0  2008     4   auto(av)     f    21    30
## 5          audi         a4   2.8  1999     6   auto(l5)     f    16    26
## 6          audi         a4   2.8  1999     6 manual(m5)     f    18    26
## 7          audi         a4   3.1  2008     6   auto(av)     f    18    27
## 8          audi a4 quattro   1.8  1999     4 manual(m5)     4    18    26
## 9          audi a4 quattro   1.8  1999     4   auto(l5)     4    16    25
## 10         audi a4 quattro   2.0  2008     4 manual(m6)     4    20    28
## # ... with 224 more rows, and 2 more variables: fl <chr>, class <chr>

Task 0 (Example)

  • Create a plot with 1 layer:
    • x mapped to cty
    • y mapped to hwy
    • point geometry
    • identity stat
    • identity position

Basic exercises (Tasks 1–4)

Facets and discrete groups

  • Two main options when comparing subsets of data
    • Each discrete set is given a different colour, shape, or size
    • Each discrete set is plotted in its own facet
g <- ggplot(mpg, aes(x = displ, y = hwy))
g + geom_point(aes(colour = drv))

g + geom_point() + facet_wrap(~drv)

Groups

  • When you map discrete variables to colour, shape, or size, ggplot2 automatically maps those variables to group
  • The group aesthetic controls how collections of items are rendered
    • In geom_line the group aesthetic determines which points will be connected by a continuous line
    • In stat_summary the group aesthetic determines which points are summarised by a common statistic
  • If a variable v is continuous but you want to use it for grouping, either specificy group = v or transform it into a discrete variable, e.g., colour = factor(v)
ggplot(mpg, aes(x = displ, y = hwy,
              colour=cyl)) +
  geom_point() + geom_smooth()
## `geom_smooth()` using method = 'loess'

ggplot(mpg, aes(x = displ, y = hwy,
              colour=factor(cyl))) +
  geom_point() + geom_smooth()
## `geom_smooth()` using method = 'loess'

  • To override the automatic grouping, specify aes(group=1) when creating a layer
ggplot(mpg, aes(x = displ, y = hwy, colour = factor(cyl))) +
    geom_point() + geom_smooth(aes(group = 1))
## `geom_smooth()` using method = 'loess'

Scales

  • Scales apply to the entire plot, i.e., to every layer
  • ggplot2 can detect what type of scale you might want, but it isn’t perfect
  • For example, you might want a logarithmic scale instead of the default linear scale
ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() +
    scale_y_log10(breaks = c(15, 30, 45))

Labels

  • Always annotate graphs with a title and human-readable labels for each aesthetic
    • x- and y-axes
    • Legends and colour bars
ggplot(mpg, aes(x = displ, y = hwy, colour = drv)) + geom_point() +
    scale_y_log10(breaks = c(15, 30, 45)) + labs(x = "Displacement (litres)",
    y = "Highway miles per gallon (log scale)",
    colour = "Drive train",
    title = "Engine size and fuel consumption")

Relabelling

ggplot(mpg, aes(x = displ, y = hwy, colour = plyr::revalue(drv,
    c(f = "Fore", r = "Rear", `4` = "4WD")))) + geom_point() +
    labs(colour = "Drive train")

ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() +
    facet_wrap(~drv, labeller = as_labeller(c(f = "Fore",
        r = "Rear", `4` = "4WD")))

Task 5

More reading

Tidying and summarizing data

  • Single table operations
  • Joins

dplyr

library(dplyr)
  • When working with data you must:
    • Figure out what you want to do.
    • Describe those tasks in the form of a computer program.
    • Execute the program.
  • The dplyr package makes these steps fast and easy:
    • By constraining your options, it simplifies how you can think about common data manipulation tasks.
    • It provides simple “verbs”, functions that correspond to the most common data manipulation tasks, to help you translate those thoughts into code.
    • It uses efficient data storage backends, so you spend less time waiting for the computer.

Source: Introduction to dplyr vignette

Pipe operator

(e <- exp(1))
## [1] 2.718282
log(e)
## [1] 1

Usage: log(x, base = exp(1))

e %>% log
## [1] 1
e %>% log()
## [1] 1
e %>% log(.)
## [1] 1
e %>% log(2)
## [1] 1.442695
e %>% log(base = 2)
## [1] 1.442695
e %>% log(., base = 2)
## [1] 1.442695

Little bunny Foo Foo
Went hopping through the forest
Scooping up the field mice
And bopping them on the head

bop(
  scoop(
    hop(foo_foo, through = forest),
    up = field_mice
  ),
  on = head
)
foo_foo %>%
  hop(through = forest) %>%
  scoop(up = field_mouse) %>%
  bop(on = head)

Single table operations

  • Receive a data frame as input
  • Return a data frame as output
    • Input data frame is unchanged
select
rename
mutate
arrange
summarise
group_by

Select a subset of columns

d %>% select(cty, hwy)
cty hwy cyl displ
11 17 6 3.3
20 26 4 2.5
11 15 8 4.6
17 24 6 3
cty hwy
11 17
20 26
11 15
17 24
d %>% select(starts_with("c"))
cty hwy cyl displ
11 17 6 3.3
20 26 4 2.5
11 15 8 4.6
17 24 6 3
cty cyl
11 6
20 4
11 8
17 6

Rename or reorder columns

d %>% select(highway = hwy, everything(), -cyl)
cty hwy cyl displ
11 17 6 3.3
20 26 4 2.5
11 15 8 4.6
17 24 6 3
highway cty displ
17 11 3.3
26 20 2.5
15 11 4.6
24 17 3
d %>% rename(highway = hwy)
cty hwy cyl displ
11 17 6 3.3
20 26 4 2.5
11 15 8 4.6
17 24 6 3
cty highway cyl displ
11 17 6 3.3
20 26 4 2.5
11 15 8 4.6
17 24 6 3

Create new columns

d %>% mutate(z = hwy/cty)
cty hwy cyl displ
11 17 6 3.3
20 26 4 2.5
11 15 8 4.6
17 24 6 3
cty hwy cyl displ z
11 17 6 3.3 1.545455
20 26 4 2.5 1.3
11 15 8 4.6 1.363636
17 24 6 3 1.411765
d %>% mutate(sqrt(displ))
cty hwy cyl displ
11 17 6 3.3
20 26 4 2.5
11 15 8 4.6
17 24 6 3
cty hwy cyl displ sqrt(displ)
11 17 6 3.3 1.81659
20 26 4 2.5 1.581139
11 15 8 4.6 2.144761
17 24 6 3 1.732051

Sort rows

d %>% arrange(cty, hwy)
cty hwy cyl displ
11 17 6 3.3
20 26 4 2.5
11 15 8 4.6
17 24 6 3
cty hwy cyl displ
11 15 8 4.6
11 17 6 3.3
17 24 6 3
20 26 4 2.5
d %>% arrange(desc(cty), hwy)
cty hwy cyl displ
11 17 6 3.3
20 26 4 2.5
11 15 8 4.6
17 24 6 3
cty hwy cyl displ
20 26 4 2.5
17 24 6 3
11 15 8 4.6
11 17 6 3.3

Keep a subset of rows

d %>% filter(cty == 11)
cty hwy cyl displ
11 17 6 3.3
20 26 4 2.5
11 15 8 4.6
17 24 6 3
cty hwy cyl displ
11 17 6 3.3
11 15 8 4.6
d %>% filter(hwy/cty > 1.4)
cty hwy cyl displ
11 17 6 3.3
20 26 4 2.5
11 15 8 4.6
17 24 6 3
cty hwy cyl displ
11 17 6 3.3
17 24 6 3

Summarise data

d %>% summarise(hwy = mean(hwy), cty = mean(cty))
cty hwy cyl displ
11 17 6 3.3
20 26 4 2.5
11 15 8 4.6
17 24 6 3
hwy cty
20.5 14.75
d %>% summarise_each(funs(mean))
cty hwy cyl displ
11 17 6 3.3
20 26 4 2.5
11 15 8 4.6
17 24 6 3
cty hwy cyl displ
14.75 20.5 6 3.35

Grouping operations

With summarise

d %>% group_by(cyl) %>% summarise_each(funs(mean))
cty hwy cyl displ
11 17 6 3.3
20 26 4 2.5
11 15 8 4.6
17 24 6 3
cyl cty hwy displ
4 20 26 2.5
6 14 20.5 3.15
8 11 15 4.6
d %>% group_by(cty) %>% summarise(mean(hwy), n())
cty hwy cyl displ
11 17 6 3.3
20 26 4 2.5
11 15 8 4.6
17 24 6 3
cty mean(hwy) n()
11 16 2
17 24 1
20 26 1

Grouping operations

With mutate

d %>% group_by(cyl) %>% mutate(max(hwy))
cty hwy cyl displ
11 17 6 3.3
20 26 4 2.5
11 15 8 4.6
17 24 6 3
cty hwy cyl displ max(hwy)
11 17 6 3.3 24
20 26 4 2.5 26
11 15 8 4.6 15
17 24 6 3 24
d %>% group_by(cty) %>% mutate(displ = displ - mean(displ))
cty hwy cyl displ
11 17 6 3.3
20 26 4 2.5
11 15 8 4.6
17 24 6 3
cty hwy cyl displ
11 17 6 -0.65
20 26 4 0
11 15 8 0.65
17 24 6 0

Grouping operations

e %>% group_by(manufacturer, model) %>% summarise(cty = mean(cty),
    n = n()) %>% filter(cty == max(cty)) %>% rename(max_cty = cty)
manufacturer model cty
audi a4 18
audi a4 21
audi a4 20
audi a4 21
audi a4 16
audi a4 18
audi a4 18
audi a4 quattro 18
audi a4 quattro 16
audi a4 quattro 20
audi a4 quattro 19
audi a4 quattro 15
audi a4 quattro 17
manufacturer model max_cty n
audi a4 18.85714 7
chevrolet malibu 18.80000 5
dodge caravan 2wd 15.81818 11
ford mustang 15.88889 9
honda civic 24.44444 9
hyundai sonata 19.00000 7
jeep grand cherokee 4wd 13.50000 8
land rover range rover 11.50000 4
lincoln navigator 2wd 11.33333 3

Separating and uniting columns

library(tidyr)
e %>% separate(trans, c("type", "detail"), sep = "[\\(\\)]",
    extra = "drop", remove = TRUE)
model year trans
a4 1999 auto(l5)
a4 1999 manual(m5)
a4 2008 manual(m6)
a4 2008 auto(av)
a4 quattro 1999 manual(m5)
a4 quattro 1999 auto(l5)
a4 quattro 2008 manual(m6)
a4 quattro 2008 auto(s6)
a6 quattro 1999 auto(l5)
model year type detail
a4 1999 auto l5
a4 1999 manual m5
a4 2008 manual m6
a4 2008 auto av
a4 quattro 1999 manual m5
a4 quattro 1999 auto l5
a4 quattro 2008 manual m6
a4 quattro 2008 auto s6
a6 quattro 1999 auto l5
  • The inverse to separate is unite
f %>% unite(trans, type, detail, sep = "_")
model year type detail
a4 1999 auto l5
a4 1999 manual m5
a4 2008 manual m6
a4 2008 auto av
a4 quattro 1999 manual m5
a4 quattro 1999 auto l5
a4 quattro 2008 manual m6
a4 quattro 2008 auto s6
a6 quattro 1999 auto l5
model year trans
a4 1999 auto_l5
a4 1999 manual_m5
a4 2008 manual_m6
a4 2008 auto_av
a4 quattro 1999 manual_m5
a4 quattro 1999 auto_l5
a4 quattro 2008 manual_m6
a4 quattro 2008 auto_s6
a6 quattro 1999 auto_l5

Wide to long

dw %>% gather(type, mpg, cty, hwy)
model displ trans cty hwy
a4 2 m6 20 31
a4 2 av 21 30
a4 3.1 av 18 27
a4q 2 m6 20 28
a4q 2 s6 19 27
a4q 3.1 s6 17 25
a4q 3.1 m6 15 25
a6q 3.1 s6 17 25
a6q 4.2 s6 16 23
model displ trans type mpg
a4 2.0 m6 cty 20
a4 2.0 av cty 21
a4 3.1 av cty 18
a4q 2.0 m6 cty 20
a4q 2.0 s6 cty 19
a4q 3.1 s6 cty 17
a4q 3.1 m6 cty 15
a6q 3.1 s6 cty 17
a6q 4.2 s6 cty 16
a4 2.0 m6 hwy 31
a4 2.0 av hwy 30
a4 3.1 av hwy 27
a4q 2.0 m6 hwy 28

Long to wide

dl %>% spread(type, mpg)
model displ trans type mpg
a4 2.0 m6 cty 20
a4 2.0 av cty 21
a4 3.1 av cty 18
a4q 2.0 m6 cty 20
a4q 2.0 s6 cty 19
a4q 3.1 s6 cty 17
a4q 3.1 m6 cty 15
a6q 3.1 s6 cty 17
a6q 4.2 s6 cty 16
a4 2.0 m6 hwy 31
a4 2.0 av hwy 30
a4 3.1 av hwy 27
a4q 2.0 m6 hwy 28
model displ trans cty hwy
a4 2 av 21 30
a4 2 m6 20 31
a4 3.1 av 18 27
a4q 2 m6 20 28
a4q 2 s6 19 27
a4q 3.1 m6 15 25
a4q 3.1 s6 17 25
a6q 3.1 s6 17 25
a6q 4.2 s6 16 23

Single table exercises (Tasks 6–11)

library(dplyr)
library(tidyr)
data(mpg, package = "ggplot2")

Joins

## # A tibble: 5 × 2
##     sid  name
##   <dbl> <chr>
## 1   100   Ann
## 2   101   Bob
## 3   102   Cam
## 4   103   Dee
## 5   104   Els
## # A tibble: 7 × 3
##     sid grade course
##   <dbl> <dbl>  <chr>
## 1   100   8.0    A94
## 2   101   6.5    A94
## 3   103   7.0    A94
## 4   100   9.0    B90
## 5   103   5.5    B90
## 6   102   7.5    C14
## 7    90   7.0    C14

Inner join

  • Only rows that match between the two tables
inner_join(students, grades)
## Joining, by = "sid"
## # A tibble: 6 × 4
##     sid  name grade course
##   <dbl> <chr> <dbl>  <chr>
## 1   100   Ann   8.0    A94
## 2   100   Ann   9.0    B90
## 3   101   Bob   6.5    A94
## 4   102   Cam   7.5    C14
## 5   103   Dee   7.0    A94
## 6   103   Dee   5.5    B90
  • sid exists in both tables so is assumed to be a key column
  • Same as
    students %>% inner_join(grades)
    students %>% inner_join(grades, by = "sid")

Left/right outer joins

  • All rows from the “left”/“right” table, even if there are no matching rows from the other
students %>% left_join(grades)
## Joining, by = "sid"
## # A tibble: 7 × 4
##     sid  name grade course
##   <dbl> <chr> <dbl>  <chr>
## 1   100   Ann   8.0    A94
## 2   100   Ann   9.0    B90
## 3   101   Bob   6.5    A94
## 4   102   Cam   7.5    C14
## 5   103   Dee   7.0    A94
## 6   103   Dee   5.5    B90
## 7   104   Els    NA   <NA>
students %>% right_join(grades)
## Joining, by = "sid"
## # A tibble: 7 × 4
##     sid  name grade course
##   <dbl> <chr> <dbl>  <chr>
## 1   100   Ann   8.0    A94
## 2   101   Bob   6.5    A94
## 3   103   Dee   7.0    A94
## 4   100   Ann   9.0    B90
## 5   103   Dee   5.5    B90
## 6   102   Cam   7.5    C14
## 7    90  <NA>   7.0    C14

Full outer join

  • All rows from each table
students %>% full_join(grades)
## Joining, by = "sid"
## # A tibble: 8 × 4
##     sid  name grade course
##   <dbl> <chr> <dbl>  <chr>
## 1   100   Ann   8.0    A94
## 2   100   Ann   9.0    B90
## 3   101   Bob   6.5    A94
## 4   102   Cam   7.5    C14
## 5   103   Dee   7.0    A94
## 6   103   Dee   5.5    B90
## 7   104   Els    NA   <NA>
## 8    90  <NA>   7.0    C14

Join excercies (Tasks 12–14)

install.packages("nycflights13")
library(nycflights13)

Reading data

  • Focus in this course is on web scraping and API calls (covered in next session)
  • Here I will briefly mention three packages that are used to read tabular data
    • If you have specific questions or need examples, let me know…

readr

library(readr)
  • For reading delimited text files representing tabular (i.e. rectangular) data
  • readr::read_csv instead of base::read.csv
  • Very fast, with better defaults, good detection of special data (e.g., dates)

readxl and haven

install.packges("readxl", dependencies = TRUE)
library(readxl)
  • For reading Excel spreadsheets
install.packages("haven", dependencies = TRUE)
library(haven)
  • For reading SAS, STAT, and SPSS data files
  • Has some limited ability to write to these formats as well

This week: Preliminary work for next week’s session (Twitter)

  • Follow this tutorial which provides detailed instructions for setting up programmatic access to Twitter.
  • Specifically, do the following before our next session (but follow the detailed instructions in the tutorial):
    1. install.packages('twitteR', dependencies = TRUE)
    2. Create a Twitter account if you do not already have one
    3. Visit the Twitter apps site and create a new app
    4. Create and record the four variables needed to access the Twitter API, and insert them into the code below to verify everything is working without error
library(twitteR)
setup_twitter_oauth("your_consumer_key", "your_consumer_secret",
    "your_access_token", "your_access_secret")
searchTwitter(searchString = "#hashtag", n = 100, lang = "en",
    since = NULL, until = NULL, locale = NULL, geocode = NULL,
    sinceID = NULL, maxID = NULL, resultType = "recent",
    retryOnRateLimit = 120)

This week: Preliminary work for next week’s session (HTML)

  • Do all of the following before our next session
  • Make sure rvest is installed:
install.packages("rvest", dependencies = TRUE)
  • Play this game to learn CSS
  • Install SelectorGadget for Chrome and try it, or else do a lot of playing around with “Inspect Element” in your favorite browser

This week: Write some code and share it

  • Be sure readr is installed:
install.packages("readr", dependencies = TRUE)
  • Create a new repository on github and then create a corresponding project in R (follow instructions from Session 1 slides)
  • Create an R Markdown file, and in that file, download, clean up, and visualize some data. Download data that would be relevant to your research — if you need inspiration, visit this list of awesome public datasets
  • Email the URL of your repository to another student (e.g., https://github.com/jasonmtroos/rook)
  • Also email the URL of your repository to me

This week: Run somebody else’s code and send them feedback

  • After you receive an email from another student, clone their repository
$ cd "name of your git workspace folder goes here"
$ git clone "url to your colleague's github repository"
  • Locate and open the .Rproj file and try to knit their R Markdown file
  • Did it work? Yes? then great!
    • Send them feedback about their R Markdown file
    • Did you think it was readable?
    • Did the code make sense to you before you ran it? Not at all?

This week: Want more practice?